-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: MultiIndex._shallow_copy #32669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: MultiIndex._shallow_copy #32669
Conversation
d13a162
to
b22bca4
Compare
pandas/core/indexes/multi.py
Outdated
@@ -689,7 +690,7 @@ def __len__(self) -> int: | |||
# -------------------------------------------------------------------- | |||
# Levels Methods | |||
|
|||
@cache_readonly | |||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the levels are actually not readonly, bacise each level's name
attribute is writeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I thought these were readonly as of 1.0 though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't mind treating it as immutable, it was just that the test using pandas/tests/indexes/multi/test_names.py::check_level_names
failed.
CategoricalIndex
and CategoricalDtype
actually have the same issue already:
>>> dtype = pd.CategoricalDtype(['a', 'b', 'c'])
>>> dtype.categories.name = "x" # dtypes a re immutable, this is not supposed to work
>>> dtype.categories
Index(['a', 'b', 'c'], dtype='object', name='x')
IMO this issue is a wart, but nothing serious, because MultiIndex.names
are now seperate from the levels, so it doesn't interfere with anything and the same issue is the same as with CategoricalIndex.categories.name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the history here? Did we deliberately make this cache_readonly to avoid a performance regression.
I also don't see why check_level_names
would be failing when this is cache_readonly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it was changed in #31651 because of a performance regression.
I've made a new version that does a shallow_copy of _cache["levels"]
also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using cache_readonly
is faster than property
. I've updated the example to reflect that.
@@ -991,7 +992,13 @@ def _shallow_copy(self, values=None, **kwargs): | |||
# discards freq | |||
kwargs.pop("freq", None) | |||
return MultiIndex.from_tuples(values, names=names, **kwargs) | |||
return self.copy(**kwargs) | |||
|
|||
result = self.copy(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment pointing back to this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
61dc8cd
to
3612860
Compare
Updated. It was needlessly complex to do ops in the |
3612860
to
9c5a56c
Compare
thanks @topper-123 very nice |
Yeah, this potentially makes copying thing around much cheaper. Will be interesting to see if there are some specific use cases where performances improve significantly. |
This PR has introduced a small bug. I'm not sure how deep it goes, but I've encountered it in import pandas as pd
idx = pd.MultiIndex.from_arrays([pd.PeriodIndex([pd.Period("2019Q1"), pd.Period("2019Q2")], name='b')])
idx2 = pd.MultiIndex.from_arrays([idx._get_level_values(level) for level in range(idx.nlevels)])
print(all(x.is_monotonic for x in idx2.levels)) # raises an error There's a simple fix for this, which involves changing |
Thanks for the report, @jacobaustin123. This is a bit tricky as you mention, but is caused by the use of This can be solved by deleting @jbrockmendel , I think you made the new |
Eventually, yes. The reason why PeriodIndex._engine exists is because PeriodEngine needs to get the So it should be removed, but there is some untangling to do first. |
Ok. a fix to the current problem seems to be to replace |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Improves performance of
MultiIndex._shallow_copy
. Example:Also adds tests for
_shallow_copy
for all index types. This ensures that this issue has been resolved for all index types.